要安装gvlma包,car包,glmnet包,psych包, pls包,boot包

首先我们读入数据:

data.raw <- read.csv(file = "/Users/andreolli/Desktop/BodyFat.csv")
data.raw <- data.raw[, -c(1,3)]

用str函数可以查看data各列的数据类型

str(data.raw)
## 'data.frame':    252 obs. of  15 variables:
##  $ BODYFAT  : num  12.6 6.9 24.6 10.9 27.8 20.6 19 12.8 5.1 12 ...
##  $ AGE      : int  23 22 22 26 24 24 26 25 25 23 ...
##  $ WEIGHT   : num  154 173 154 185 184 ...
##  $ HEIGHT   : num  67.8 72.2 66.2 72.2 71.2 ...
##  $ ADIPOSITY: num  23.7 23.4 24.7 24.9 25.6 26.5 26.2 23.6 24.6 25.8 ...
##  $ NECK     : num  36.2 38.5 34 37.4 34.4 39 36.4 37.8 38.1 42.1 ...
##  $ CHEST    : num  93.1 93.6 95.8 101.8 97.3 ...
##  $ ABDOMEN  : num  85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
##  $ HIP      : num  94.5 98.7 99.2 101.2 101.9 ...
##  $ THIGH    : num  59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
##  $ KNEE     : num  37.3 37.3 38.9 37.3 42.2 42 38.3 39.4 38.3 41.7 ...
##  $ ANKLE    : num  21.9 23.4 24 22.8 24 25.6 22.9 23.2 23.8 25 ...
##  $ BICEPS   : num  32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...
##  $ FOREARM  : num  27.4 28.9 25.2 29.4 27.7 30.6 27.8 29 31.1 30 ...
##  $ WRIST    : num  17.1 18.2 16.6 18.2 17.7 18.8 17.7 18.8 18.2 19.2 ...

可以看到除了年龄为整数型随机变量外,其他变量都为浮点型。 没有明显的因子型随机变量。

我们可以先画出因变量对各个自变量的散点图进行探索性分析:

par(mfrow = c(2,2))
for(i in 1:14)
{
  plot(x = data.raw[,i + 1], y = data.raw[,1])
}

从各散点图中可以看出BODYFAT与多个自变量有着近似线性关系,所以可以首先尝试建立线性模型。某些自变量对BODYFAT的散点图近似有相同趋势提示我们在建立线性模型中可能存在多重共线性的问题。同时散点图中一些点与其他点偏离较大提示数据中可能存在异常点。另外,注意到年龄这一整数型变量与BODYFAT也有着近似线性的趋势,提示我们可以不把年龄变量作为因子型变量。

下面, 我们首先尝试建立BODYFAT对于其他自变量的多元线性回归模型,查看其效果:

model.raw <- lm(BODYFAT ~ ., data = data.raw)
summary(model.raw)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = data.raw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2573  -2.5919  -0.1031   2.9040   9.2754 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.519e+01  1.611e+01  -0.943   0.3467    
## AGE          5.688e-02  3.003e-02   1.894   0.0594 .  
## WEIGHT      -8.130e-02  4.989e-02  -1.630   0.1045    
## HEIGHT      -5.307e-02  1.034e-01  -0.513   0.6084    
## ADIPOSITY    6.101e-02  2.780e-01   0.219   0.8265    
## NECK        -4.450e-01  2.184e-01  -2.037   0.0427 *  
## CHEST       -3.087e-02  9.779e-02  -0.316   0.7526    
## ABDOMEN      8.790e-01  8.545e-02  10.286   <2e-16 ***
## HIP         -2.031e-01  1.371e-01  -1.481   0.1398    
## THIGH        2.274e-01  1.356e-01   1.677   0.0948 .  
## KNEE        -9.927e-04  2.298e-01  -0.004   0.9966    
## ANKLE        1.572e-01  2.076e-01   0.757   0.4496    
## BICEPS       1.485e-01  1.600e-01   0.928   0.3543    
## FOREARM      4.297e-01  1.849e-01   2.324   0.0210 *  
## WRIST       -1.479e+00  4.967e-01  -2.978   0.0032 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.996 on 237 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7342 
## F-statistic: 50.52 on 14 and 237 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model.raw)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

结果显示调整R方的值为0.7298较大,残差图没有明显趋势,正态qq图没有明显问题,但是后两个图形显示模型中存在异常点说明需要去除数据中的异常点。多个自变量不显著说明我们可能需要进行变量选择。我们可以检验模型中是否存在多重共线性:

require(car)
## Loading required package: car
vif(model.raw)
##       AGE    WEIGHT    HEIGHT ADIPOSITY      NECK     CHEST   ABDOMEN 
##  2.250902 33.786851  2.256593 16.163444  4.430734 10.684562 13.346689 
##       HIP     THIGH      KNEE     ANKLE    BICEPS   FOREARM     WRIST 
## 15.158277  7.961508  4.828828  1.945514  3.674508  2.193390  3.379612

我们发现模型中存在着严重的多重共线性问题。多重共线性的存在不会影响异常值点的诊断,我们可以首先去掉异常点。第182行数据BODYFAT为0显然错误,需要去除。 标准化残差图和residual vs leverage图显示了多个异常点和高杠杆点, Cook’s D显示第四十二个点为强影响点。强影响点存在时一次去除多个点可能造成错误去除,我们先去掉点42和182之后再次拟合模型:

data.raw2 <- data.raw[-c(42, 182),]
model.raw2 <- lm(BODYFAT ~ ., data = data.raw2)
par(mfrow = c(2,2))
plot(model.raw2)

点39近似为强影响点,去除后再次建立模型:

data.raw3 <- data.raw[-c(39, 42, 182),]
model.raw3 <- lm(BODYFAT ~ ., data = data.raw3)
par(mfrow = c(2,2))
plot(model.raw3)

现在不再有明显的强影响点,标准化残差图显示没有明显的离群点,但是标准化残差vs杠杆值图显示存在不少高杠杆值点,我们用帽矩阵法去除强影响点。我们定义下面的函数来画出帽矩阵值及对应的点的序号:

hat.plot <- function(fit){
  p <- length(coefficients(fit))
  n <- length(fitted(fit))
  plot(hatvalues(fit),main = "Index Plot of Hat Values")
  abline(h=c(2,3)*p/n,col="red",lty=2)
  identify(1:n, hatvalues(fit), names(hatvalues(fit)))  
  #这句产生交互效果,选中某个点后,关闭后返回点的名称
}

寻找高杠杆点:

hat.plot(model.raw3)

## integer(0)

发现高杠杆点为31, 36, 41, 54, 86, 106, 159, 163, 175, 206, 216, 221,去除这些高杠杆点再次建立模型:

data.raw4 <- data.raw[-c(39, 182, 42, 31, 36, 41, 54, 86, 106, 159, 163, 175, 206, 216, 221),]
model.raw4 <- lm(BODYFAT ~ ., data = data.raw4)
par(mfrow = c(2,2))
plot(model.raw4)

可以看到model.raw3中明显的高杠杆值点被去除了,再检验一下高杠杆值点:

hat.plot(model.raw4)

## integer(0)

5, 74, 96, 172, 240, 242为高杠杆值点但程度不大,精确起见我们还是将它们去除:

data.raw5 <- data.raw[-c(39, 182, 42, 31, 36, 41, 54, 86, 106, 159, 163, 175, 206, 216, 221, 5, 74, 96, 172, 240, 242),]
model.raw5 <- lm(BODYFAT ~ ., data = data.raw5)
par(mfrow = c(2,2))
plot(model.raw5)

par(mfrow = c(1,1))
hat.plot(model.raw5)

## integer(0)

现在没有明显的离群点,也没有强影响点和高杠杆值点了。总结一下,182为错误数据,点39,42为强影响点,点5, 31, 36, 41, 54, 74, 86, 96, 106, 159, 163, 172, 175, 206, 216, 221, 240, 242为高杠杆值点。我们去掉这些异常点,将剩下的清洗后的数据导入变量data, 用清洗后的数据建立多元线性模型(实际上就是model.raw5):

data <- data.raw5
model <- lm(BODYFAT ~ ., data = data)
summary(model)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5735 -2.8383 -0.2265  2.8896  9.2112 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -79.38030   64.76839  -1.226   0.2217    
## AGE           0.05661    0.03216   1.760   0.0798 .  
## WEIGHT       -0.24337    0.18133  -1.342   0.1810    
## HEIGHT        0.89706    0.91556   0.980   0.3283    
## ADIPOSITY     1.46159    1.31156   1.114   0.2664    
## NECK         -0.32287    0.25493  -1.266   0.2067    
## CHEST        -0.09525    0.11402  -0.835   0.4044    
## ABDOMEN       0.83600    0.09384   8.909   <2e-16 ***
## HIP          -0.16197    0.16258  -0.996   0.3203    
## THIGH         0.20183    0.14944   1.351   0.1782    
## KNEE         -0.05637    0.26422  -0.213   0.8313    
## ANKLE         0.08080    0.36407   0.222   0.8246    
## BICEPS        0.19164    0.19883   0.964   0.3362    
## FOREARM       0.19918    0.33890   0.588   0.5573    
## WRIST        -1.30244    0.58491  -2.227   0.0270 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.969 on 216 degrees of freedom
## Multiple R-squared:  0.7159, Adjusted R-squared:  0.6975 
## F-statistic: 38.89 on 14 and 216 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model)

可以看出调整R方仍然比较大,残差图没有明显趋势,模型假设中独立性,线性性,方差齐性比较好的满足了,也不再有明显的异常点,但是正态性假设似乎稍有违背,但是残差分布似乎是对称的并且偏离正态分布的程度不大。 我们用R语言实战第八章中提供的各个函数对上述的各个论断进行严格的检验:

独立性假定:

durbinWatsonTest(model)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.06838182      1.855548   0.208
##  Alternative hypothesis: rho != 0

Durbin-Watson检验的p值大于0.05,说明独立性假定满足。

同方差性假定:

ncvTest(model)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.001391821    Df = 1     p = 0.9702401

计分检验p值非常高,说明同方差性假定满足。

线性性假定以及误差项的偏度,峰度与正态偏度峰度相近的检验(所用函数在中文版R语言实战第181页):

require(gvlma)
## Loading required package: gvlma
gvmodel <- gvlma(model)
summary(gvmodel)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5735 -2.8383 -0.2265  2.8896  9.2112 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -79.38030   64.76839  -1.226   0.2217    
## AGE           0.05661    0.03216   1.760   0.0798 .  
## WEIGHT       -0.24337    0.18133  -1.342   0.1810    
## HEIGHT        0.89706    0.91556   0.980   0.3283    
## ADIPOSITY     1.46159    1.31156   1.114   0.2664    
## NECK         -0.32287    0.25493  -1.266   0.2067    
## CHEST        -0.09525    0.11402  -0.835   0.4044    
## ABDOMEN       0.83600    0.09384   8.909   <2e-16 ***
## HIP          -0.16197    0.16258  -0.996   0.3203    
## THIGH         0.20183    0.14944   1.351   0.1782    
## KNEE         -0.05637    0.26422  -0.213   0.8313    
## ANKLE         0.08080    0.36407   0.222   0.8246    
## BICEPS        0.19164    0.19883   0.964   0.3362    
## FOREARM       0.19918    0.33890   0.588   0.5573    
## WRIST        -1.30244    0.58491  -2.227   0.0270 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.969 on 216 degrees of freedom
## Multiple R-squared:  0.7159, Adjusted R-squared:  0.6975 
## F-statistic: 38.89 on 14 and 216 DF,  p-value: < 2.2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model) 
## 
##                      Value p-value                Decision
## Global Stat        5.75761 0.21800 Assumptions acceptable.
## Skewness           0.05877 0.80845 Assumptions acceptable.
## Kurtosis           2.79973 0.09428 Assumptions acceptable.
## Link Function      2.36312 0.12423 Assumptions acceptable.
## Heteroscedasticity 0.53599 0.46410 Assumptions acceptable.

结果显示线性性假定满足,同时误差项的分布为对称分布,且峰度与正态分布的峰度近似,其分布与正态分布比较相近。

正态性检验:

shapiro.test(model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.9909, p-value = 0.1588

Shapiro-Wilk normality test显示正态性假定未满足,但是p值不是特别大,说明偏离程度比较低。我们先尝试对BODYFAT数据进行box-cox变换:

require(car)
summary(powerTransform(data$BODYFAT))
## bcPower Transformation to Normality 
##              Est Power Rounded Pwr Wald Lwr bnd Wald Upr Bnd
## data$BODYFAT    0.8509           1        0.564       1.1377
## 
## Likelihood ratio tests about transformation parameters
##                             LRT df         pval
## LR test, lambda = (0) 37.326539  1 9.991565e-10
## LR test, lambda = (1)  1.020027  1 3.125125e-01

函数结果显示选取幂次0.85对BODYFAT数据进行变换并拟合模型。

model.trans <- lm(BODYFAT^0.85 ~ ., data = data)
summary(model.trans)
## 
## Call:
## lm(formula = BODYFAT^0.85 ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6056 -1.5184 -0.1221  1.5221  4.8783 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -45.10272   36.11292  -1.249   0.2130    
## AGE           0.03190    0.01793   1.779   0.0766 .  
## WEIGHT       -0.14251    0.10111  -1.410   0.1601    
## HEIGHT        0.52793    0.51049   1.034   0.3022    
## ADIPOSITY     0.82813    0.73129   1.132   0.2587    
## NECK         -0.18414    0.14214  -1.295   0.1965    
## CHEST        -0.04842    0.06358  -0.762   0.4472    
## ABDOMEN       0.46450    0.05232   8.878 2.66e-16 ***
## HIP          -0.08934    0.09065  -0.986   0.3255    
## THIGH         0.12092    0.08332   1.451   0.1482    
## KNEE         -0.02956    0.14732  -0.201   0.8412    
## ANKLE         0.03987    0.20300   0.196   0.8445    
## BICEPS        0.11560    0.11086   1.043   0.2982    
## FOREARM       0.10817    0.18896   0.572   0.5676    
## WRIST        -0.70723    0.32613  -2.169   0.0312 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.213 on 216 degrees of freedom
## Multiple R-squared:  0.7123, Adjusted R-squared:  0.6936 
## F-statistic:  38.2 on 14 and 216 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model.trans)

shapiro.test(model.trans$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model.trans$residuals
## W = 0.99061, p-value = 0.1417

我们发现box-cox变换很厉害然而并没有什么卵用。但是我们可以论述用模型model也是合理的:在线性回归分析中我们证明过,当独立性假定,线性性假定,同方差性假定满足,并且误差项分布为对称分布且与正态分布比较相近时,线性回归模型依然是稳健的,故我们仍然可以用模型model。

下面我们还需要处理多重共线性问题。我们考虑三种不同的方法:1.lasso regression做变量选择, 2.主成分分析 3.偏最小二乘。

1.lasso regression做变量选择 首先我们考虑lasso regression模型,用glmnet包(lasso法发明人写的包),介绍在https://cosx.org/2016/10/data-mining-1-lasso/。 cv.glmnet用交叉验证方法选出合理的lambda值以及对应的变量。

require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
y = as.matrix(data[,1])
x = as.matrix(data[,-1])
fit = cv.glmnet(x = x, y = y, family = "gaussian", nlambda = 100, alpha = 1, standardize=TRUE, nfolds = 10)
print(fit$lambda.1se)
## [1] 0.4295899
print(log(fit$lambda.1se))
## [1] -0.8449242
plot(fit)

每次运行cv.glmnet时得到的结果是不一样的,上面只显示了一次的结果,多次重复后发现保留三到四个变量是合理的。下面我们确定保留哪些变量:

fit2 = glmnet(x = x, y = y, family = "gaussian", nlambda = 100, alpha = 1, standardize=TRUE)
plot(fit2, xvar = "lambda", label = TRUE)

print(fit2$jerr)
## [1] 0

可以看到第1, 3, 7, 14个自变量被保留,我们把相应的数据存入data.lasso中并拟合模型:

data.lasso <- data[,c(1, 2, 4, 8, 15)]
model.lasso <- lm(BODYFAT ~., data = data.lasso)
summary(model.lasso)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = data.lasso)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3819 -2.9060 -0.2866  3.0046  8.8346 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.72103    8.22120   0.209 0.834371    
## AGE          0.04391    0.02316   1.896 0.059226 .  
## HEIGHT      -0.30631    0.12426  -2.465 0.014442 *  
## ABDOMEN      0.70203    0.03472  20.219  < 2e-16 ***
## WRIST       -1.54019    0.40955  -3.761 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.97 on 226 degrees of freedom
## Multiple R-squared:  0.7027, Adjusted R-squared:  0.6974 
## F-statistic: 133.5 on 4 and 226 DF,  p-value: < 2.2e-16
plot(model.lasso)

vif(model.lasso)
##      AGE   HEIGHT  ABDOMEN    WRIST 
## 1.261470 1.406395 1.590373 1.838827

可以看到调整R方为0.6974与保留所有自变量时相差不大,但是年龄变量在0.05显著性下不显著。各个图形显示模型的各项假定除了正态性假定仍有轻微偏离以外其他假定都很好的满足了。方差膨胀因子足够小,多重共线性问题解决了。注意lasso并不保证选出最优模型,我们去掉HEIGHT变量替换为WEIGHT变量后发现AGE不再显著,再去掉AGE变量后发现调整R方更大了,而方差膨胀因子仍然不大。我们也将这个模型作为一个备选:

data.lasso.adjusted <- data[,c(1, 3, 8, 15)]
model.lasso.adjusted <- lm(BODYFAT ~., data = data.lasso.adjusted)
summary(model.lasso.adjusted)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = data.lasso.adjusted)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0714 -3.0115 -0.2233  2.9141  9.3720 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.46939    6.62993  -4.143 4.83e-05 ***
## WEIGHT       -0.08902    0.02321  -3.836 0.000162 ***
## ABDOMEN       0.87791    0.05534  15.863  < 2e-16 ***
## WRIST        -1.03242    0.42771  -2.414 0.016579 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.969 on 227 degrees of freedom
## Multiple R-squared:  0.7015, Adjusted R-squared:  0.6976 
## F-statistic: 177.8 on 3 and 227 DF,  p-value: < 2.2e-16
plot(model.lasso.adjusted)

vif(model.lasso.adjusted)
##   WEIGHT  ABDOMEN    WRIST 
## 5.222052 4.042555 2.006588

2.主成分分析: 对数据标准化以便建立主成分分析模型:

data.scaled <- as.data.frame(scale(data, center = TRUE, scale = TRUE))

下面考虑主成分分析模型,所用函数介绍在R语言实战第十四章:

require(psych)
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit

fa.parallel函数可以确定主成分的个数,所画出的碎石图同时显示了三种选择主元素个数的准则:

fa.parallel(data.scaled[,-1], fa = "pc", n.iter = 1000, show.legend = FALSE, main = "Scree plot with parallel analysis")
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

准则显示我们可以选择保留二个主成分。保留三个主成分也是合理的,我们可以将保留二个或三个主成分的主成分分析模型作为备选。

principal(data.scaled[,-1], nfactors = 2, rotate = "none")
## Principal Components Analysis
## Call: principal(r = data.scaled[, -1], nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##            PC1   PC2   h2    u2 com
## AGE       0.00  0.85 0.73 0.274 1.0
## WEIGHT    0.98 -0.06 0.96 0.038 1.0
## HEIGHT    0.48 -0.59 0.58 0.420 1.9
## ADIPOSITY 0.90  0.28 0.89 0.107 1.2
## NECK      0.85  0.16 0.76 0.244 1.1
## CHEST     0.89  0.27 0.86 0.138 1.2
## ABDOMEN   0.87  0.33 0.86 0.136 1.3
## HIP       0.93 -0.09 0.87 0.132 1.0
## THIGH     0.87 -0.19 0.79 0.214 1.1
## KNEE      0.88 -0.09 0.79 0.212 1.0
## ANKLE     0.75 -0.32 0.66 0.336 1.4
## BICEPS    0.85 -0.02 0.72 0.281 1.0
## FOREARM   0.86 -0.10 0.74 0.258 1.0
## WRIST     0.77  0.12 0.60 0.396 1.0
## 
##                        PC1  PC2
## SS loadings           9.26 1.55
## Proportion Var        0.66 0.11
## Cumulative Var        0.66 0.77
## Proportion Explained  0.86 0.14
## Cumulative Proportion 0.86 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  200.83  with prob <  4.9e-16 
## 
## Fit based upon off diagonal values = 0.99
principal(data.scaled[,-1], nfactors = 3, rotate = "none")
## Principal Components Analysis
## Call: principal(r = data.scaled[, -1], nfactors = 3, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##            PC1   PC2   PC3   h2    u2 com
## AGE       0.00  0.85  0.44 0.92 0.077 1.5
## WEIGHT    0.98 -0.06  0.02 0.96 0.037 1.0
## HEIGHT    0.48 -0.59  0.51 0.84 0.155 2.9
## ADIPOSITY 0.90  0.28 -0.28 0.97 0.028 1.4
## NECK      0.85  0.16  0.09 0.76 0.236 1.1
## CHEST     0.89  0.27 -0.08 0.87 0.132 1.2
## ABDOMEN   0.87  0.33 -0.09 0.87 0.128 1.3
## HIP       0.93 -0.09 -0.11 0.88 0.120 1.0
## THIGH     0.87 -0.19 -0.30 0.87 0.125 1.3
## KNEE      0.88 -0.09  0.12 0.80 0.197 1.1
## ANKLE     0.75 -0.32  0.21 0.71 0.291 1.5
## BICEPS    0.85 -0.02 -0.19 0.76 0.245 1.1
## FOREARM   0.86 -0.10 -0.02 0.74 0.257 1.0
## WRIST     0.77  0.12  0.44 0.80 0.204 1.6
## 
##                        PC1  PC2  PC3
## SS loadings           9.26 1.55 0.95
## Proportion Var        0.66 0.11 0.07
## Cumulative Var        0.66 0.77 0.84
## Proportion Explained  0.79 0.13 0.08
## Cumulative Proportion 0.79 0.92 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  92.77  with prob <  0.00044 
## 
## Fit based upon off diagonal values = 0.99

下面用偏最小二乘法建立模型, 调用pls包:

require("pls")
## Loading required package: pls
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
fit.pls <- plsr(BODYFAT~., data=data.scaled, scale=T, validation="CV")
summary(fit.pls)
## Data:    X dimension: 231 14 
##  Y dimension: 231 1
## Fit method: kernelpls
## Number of components considered: 14
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.002   0.7543   0.6251   0.6032   0.5948   0.5814   0.5727
## adjCV        1.002   0.7541   0.6243   0.6020   0.5931   0.5785   0.5705
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.5679   0.5680   0.5676    0.5689    0.5686    0.5712    0.5695
## adjCV   0.5662   0.5662   0.5658    0.5670    0.5668    0.5692    0.5677
##        14 comps
## CV       0.5703
## adjCV    0.5682
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X          65.39    76.34    82.27    86.76    89.31    91.97    93.80
## BODYFAT    43.99    62.95    66.50    68.50    70.87    71.23    71.37
##          8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X          94.93    95.60     96.33     97.37     98.69     99.50
## BODYFAT    71.40    71.43     71.45     71.46     71.49     71.55
##          14 comps
## X          100.00
## BODYFAT     71.59

由于每次运行时所得的adjCV的值不一样,经过多次重复后发现我们可以使用2到3个成分。 我们将2到3个成分的偏最小二乘模型作为备选。

下面我们对用三种方法得到的六个备选模型用k-fold交叉验证方法选出最好的模型。依照惯例,k可以选择5或10。为了保证cv得出的结果的数量级一致,我们都用正规化后的数据进行分析(如用调整R方做交叉验证,则不用正规化):

安装boot包

require(boot)
## Loading required package: boot
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:car':
## 
##     logit

对lasso模型用10-fold 交叉验证 100次并计算adjusted cv的平均值:

  data.lasso.scaled <- as.data.frame(scale(data.lasso, center = T, scale = T))
cv.err <- rep(0,30)
for(i in 1:30){

  glm.fit <- glm(BODYFAT ~ .,data = data.lasso.scaled)
  cv.err[i] = cv.glm(data.lasso.scaled,glm.fit,K=10)$delta[2]
}
plot(cv.err)

print(mean(cv.err))
## [1] 0.309237

对主成分分析模型进行交叉检验的函数,n为主成分数

get.cv.prin <- function(n)
{
covmatrix <- cov(data.scaled[,-1])
eigen.vector <- eigen(covmatrix)$vectors
eigen.vector.n <- eigen.vector[,1:n]
prin.data.n <- as.matrix(data.scaled[,-1]) %*% eigen.vector.n
data.frame.n <- as.data.frame(cbind(data.scaled[,1], prin.data.n))
cv.err <- rep(0,30)
for(i in 1:30) {
  glm.fit <- glm(V1 ~ V2 + V3, data = data.frame.n)
  cv.err[i] = cv.glm(data.frame.n,glm.fit,K=10)$delta[2]
}
plot(cv.err)
print(mean(cv.err))
}
get.cv.prin(2)

## [1] 0.4789226
get.cv.prin(3)

## [1] 0.4793388
get.cv.prin(4)

## [1] 0.478881
get.cv.prin(5)

## [1] 0.4789116

pls的cv在之前已经给出来了,最后发现竟然是lasso优于主成分优于偏最小二乘?我觉得交叉检验这一块有问题做的很不好,先来个初稿把